A Data Augmentation Method for English-Vietnamese Neural Machine Translation

نویسندگان

چکیده

The translation quality of machine systems depends on the parallel corpus used for training, in particular quantity and corpus. However, building a high-quality large-scale is complex expensive, particularly specific domain Therefore, data augmentation techniques are widely translation. input back-translation method monolingual text, which available from many sources, therefore this can be easily effectively implemented to generate synthetic data. In practice, texts collected different sources websites often have errors grammar spelling, sentence mismatch or freestyle. output reduced, leading low-quality generated by back-translation. study, we propose improving Moreover, supplemented pruning table. We experimented with an English-Vietnamese neural using IWSLT2015 dataset training testing legal domain. results showed that proposed augment translation, thereby quality. our experimental cases, BLEU score increased 16.37 points compared baseline system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Augmentation for Low-Resource Neural Machine Translation

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, syn...

متن کامل

Generation of Vietnamese for French-Vietnamese and English-Vietnamese Machine Translation

This paper presents the implementation of the Vietnamese generation module in ITS3, a multilingual machine translation (MT) system based on the Government & Binding (GB) theory. Despite well-designed generic mechanisms of the system, it turned out that the task of generating Vietnamese posed non-trivial problems. We therefore had to deviate from the generic code and make new design and implemen...

متن کامل

Pivoting Methods and Data for Czech-Vietnamese Translation via English

The statistical approach to machine translation (MT) relies heavily on large parallel corpora. For many language pairs, this can be a significant obstacle. A promising alternative is pivoting, i.e. making use of a third language to support the translation. There are a number of pivoting methods, but unfortunately, they were not evaluated in comparable settings. We focus on one particular langua...

متن کامل

Building A Training Corpus For Word Sense Disambiguation In English-To-Vietnamese Machine Translation

The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual context. In order to solve this ambiguation, formerly, people used to resort to many hand-coded rules. Nevertheless, manually building these rules ...

متن کامل

Dynamic Data Selection for Neural Machine Translation

Intelligent selection of training data has proven a successful technique to simultaneously increase training efficiency and translation performance for phrase-based machine translation (PBMT). With the recent increase in popularity of neural machine translation (NMT), we explore in this paper to what extent and how NMT can also benefit from data selection. While state-of-the-art data selection ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2023

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2023.3252898